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Abstract 

Clustering provides a common means of identifying structure in complex data, and 
there is renewed interest in clustering as a tool for the analysis of large data sets in many 
fields. A natural question is how many clusters are appropriate for the description of a 
given system. Traditional approaches to this problem are based either on a framework 
in which clusters of a particular shape are assumed as a model of the system or on a 
two-step procedure in which a clustering criterion determines the optimal assignments 
for a given number of clusters and a separate criterion measures the goodness of the 
classification to determine the number of clusters. In a statistical mechanics approach, 
clustering can be seen as a trade-off between energy- and entropy-like terms, with lower 
temperature driving the proliferation of clusters to provide a more detailed description 
of the data. For finite data sets, we expect that there is a limit to the meaningful 
structure that can be resolved and therefore a minimum temperature beyond which we 
will capture sampling noise. This suggests that correcting the clustering criterion for 
the bias which arises due to sampling errors will allow us to find a clustering solution 
at a temperature which is optimal in the sense that we capture maximal meaningful 
structure — without having to define an external criterion for the goodness or stability 
of the clustering. We show that, in a general information theoretic framework, the finite 
size of a data set determines an optimal temperature, and we introduce a method for 
finding the maximal number of clusters which can be resolved from the data in the 
hard clustering limit. 



1 Introduction 

Much of our intuition about the world around us involves the idea of clustering: many differ- 
ent acoustic waveforms correspond to the same syllable, many different images correspond to 
the same object, and so on. It is plausible that a mathematically precise notion of clustering 
in the space of sense data may approximate the problems solved by our brains. Clustering 
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methods also are used in many different scientific domains as a practical tool to evaluate 
structure in complex data. Interest in clustering has increased recently because of new areas 
of application, such as data mining, image- and speech-processing and bioinformatics. In 
particular, many groups have used clustering methods to analyze the results of genome-wide 
expression experiments, hoping to discover genes with related functions as members of the 
same cluster; see, for example, Eisen, Spellman, Brown and Botstein (1998). A central issue 
in these and other applications of clustering is how many clusters provide an appropriate 
description of the data. The estimation of the true number of classes has been recognized as 
"one of the most difficult problems in cluster analysis" by Bock (1996), who gives a review 
of some methods that address the issue. 

The goal of clustering is to group data in a meaningful way. This is achieved by optimization 
of a so called "clustering criterion" (an objective function), and a large variety of intuitively 
reasonable criteria have been used in the literature (a summary is given in Gordon, 1999). 
Clustering methods include agglomerative clustering procedures such as described by Ward 
(1963) and iterative re-allocation methods, such as the commonly used K-means algorithm 
(Lloyd, 1957; MacQueen, 1967), which reduces the sum of squares criterion, the average 
of the within cluster squared distances. More recently, algorithms with physically inspired 
criteria were introduced (Blatt, Wiseman and Domany 1996; Horn and Gottlieb, 2002). All 
these clustering methods have in common that the number of clusters has to be found by 
another criterion. Often, a two-step procedure is performed: The optimal partition is found 
for a given data set, according to the defined objective function, and then a separate criterion 
is applied to test the robustness of the results against noise due to finite sample size. Such 
procedures include the definition of an intuitively reasonable criterion for the goodness of the 
classification, as in Tibshirani, Walther and Hastie (2001), or performing cross-validation 
(Stone, 1974) and related methods in order to estimate the prediction error and to find the 
number of clusters that minimizes this error (e.g. Smyth, 2000). Roth, Lange, Braun and 
Buhmann (2002) quantify the goodness of the clustering via a resampling approach. 

It would be attractive if these two steps could be combined in a single principle. In a 
sense this is achieved in the probabilistic mixture model approach, but at the cost of assum- 
ing that the data can be described by a mixture of N c multivariate distributions with some 
parameters that determine their shape. Now the problem of finding the number of clusters 
is a statistical model selection problem. There is a trading between complexity of the model 
and goodness of fit. One approach to model selection is to compute the total probability 
that models with iV c clusters can give rise to the data, and then one finds that phase space 
factors associated with the integration over model parameters serve to discriminate against 
more complex models (Balasubramanian, 1997). This Bayesian approach has been used to 
determine the number of clusters (Fraley and Raftery, 2002). 

From an information theoretic point of view, clustering is most fundamentally a strategy 
for lossy data compression: the data are partitioned into groups such that the data could 
be described in the most efficient way (in terms of bit cost) by appointing a representative 
to each group. The clustering of the data is achieved by compressing the original data into 
representatives, throwing away information that is not relevant to the analysis. While most 
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classical approaches in statistics give an explicit definition of a similarity measure, in rate- 
distortion theory we arrive at a notion of similarity through a fidelity criterion implemented 
by a distortion function (Shannon 1948). The choice of the distortion function provides 
an implicit distinction between relevant and irrelevant information in the raw data. The 
notion of relevance was made explicit by Tishby, Pereira and Bialek (1999) who defined rele- 
vant information as the information which the data provide about an auxiliary variable and 
performed lossy compression, keeping as much relevant information as possible. This formu- 
lation, termed "Information Bottleneck method" (IB), is attractive, because the objective 
function follows only from information theoretical principles. In particular, this formulation 
does not require an explicit definition of a measure for similarity or distortion. The trade-off 
between the complexity of the model on one hand and the amount of relevant information it 
captures on the other hand is regulated by a trade-off parameter. For a given problem, the 
complete range of this trade-off is meaningful, and the structure of the trade-of characterizes 
the "clusterability" of the data. However, for a finite data set, there should be a maximal 
value for this trade-off after which we start to "overfit," and this issue has not yet been 
addressed in the context of the IB. In related work, Buhmann and Held (2000) derived for a 
particular class of histogram clustering models a lower bound on the annealing temperature 
from a bound on the probability of a large deviation between the error made on the training 
data and the expected error. 

In this article, we follow the intuition that if a model — which, in this context, is a (prob- 
abilistic) partition of the data set — captures information (or structure) in the data, then 
we should be able to quantify this structure in a way that corrects automatically for the 
overfitting of finite data sets. Attempts to capture only this "corrected information" will, 
by definition, not be sensitive to noise. Put another way, if we would separate at the outset 
real structure from spurious coincidences due to undersampling, then we could fit only the 
real structure. In the context of information estimation from finite samples there is a signif- 
icant literature on the problem and we argue here that the known finite sample correction 
to information estimates is (in some limits) sufficient to achieve the "one-step" compression 
and clustering in the sense described above, leading us naturally to a principled method of 
finding the best clustering that is consistent with a finite data set. 

We should point out that in general we are not looking for the "true" number of clus- 
ters, but rather for the maximum number of clusters which can be resolved from a finite 
data set. This number equals the true number only if there exists a true number of classes 
and if the data set is also large enough to allow us to resolve them. 

2 Rate distortion theory and the Information Bottle- 
neck Method 

If data x G X are chosen from a probability distribution P(x), then a complete description of 
a single data point requires an average code length equal to the entropy of the distribution, 
S(x) = — J2 X P{ x ) l°g2 [-P^)] bits. On the other hand, if we assign points to clusters c G 
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{1, 2, • • • , N c }, then we need at most log 2 (iV c ) bits. For N c « \X\ we have log 2 (iV c ) << 
S(x), and our intuition is that many problems will allow substantial compression at little 
cost if we assign each x to a cluster c and approximate x by a representative x c . 



Rate distortion theory formalizes the cost of approximating the signal x by x c as the 
expected value of some distortion function, d(x, x c ) (Shannon 1948). This distortion measure 
can, but need not be a metric. Lossy compression is achieved by assigning the data to clusters 
such that the mutual information 



J(c; x) = P{c\x)P{x) log 2 



P{c\x) 

Tic)' 



is minimized. The minimization is constrained by fixing the expected distortion 

{d(x,x c )) = P(c\x)P(x)d(x, x c ). 



(1) 



(2) 



This leads to the variational problem 



min \(d(x, x c )) + TI(c: x)] . 

P(c\x) 



The (formal) solution is a Boltzmann distribution 1 



P(c\x) 



P(c) 
Z(x;T) 



exp 



d{x, x c ) 



with the distortion playing the role of energy, and the normalization 



(3) 



(4) 



(5) 



playing the role of a partition function (Rose, Gurewitz and Fox, 1990). The representatives, 
x c , often simply called cluster centers, are determined by the condition that all of the "forces" 
within each cluster balance for a test point located at the cluster center, 2 



x\c 



_d_ 

dx, 



0. 



(6) 



Recall that if the distortion measure is the squared distance, d(x, x c 



[x — x r 



then Eq. 



© becomes x c = J2 X xP(x\c); the cluster center is in fact the center of mass of the points 
which are assigned to the cluster. 



The Lagrange parameter T regulates the trade-off between the detail we keep and the bit 
lji/ _ ln(2) , because the information is measured in bits in Eq. Q). P(c) is calculated as P(c) — 

2 This condition is not independent of the original variational problem. Optimizing the objective function 
with respect to x c , we find: g|- [(d(x,x c )) +TI(c;x)} = <^> ^-(d(x,x c )) = p(c)J2 x P ( x \ c )^ d ( x ^ x c) 
0: Vc > K(|. © 
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cost we are willing to pay; in analogy with statistical mechanics, T often is referred to as 
the temperature (Rose, Gurewitz and Fox, 1990). T measures the softness of the cluster 
membership. The deterministic limit (T — > 0) is the limit of hard clustering solutions. As 
we lower T there are phase transitions among solutions with different numbers of distinct 
clusters, and if we follow these transitions we can trace out a curve of (d) vs. I(c;x), both 
evaluated at the minimum. This is the rate-distortion curve and is analogous to plotting 
energy vs. (negative) entropy with temperature varying parametrically along the curve. 
Crucially, there is no optimal temperature which provides the unique best clustering, and 
thus there is no optimal number of clusters: more clusters always provide a more detailed 
description of the original data and hence allow us to achieve smaller average values of the 
distortion d(x,x c ), while the cost of the encoding increases. 



Information Bottleneck method The distortion function implicitly selects the features 
that are relevant for the compression. However, for many problems, we know explicitly 
what it is that we want to keep information about while compressing the data, but one can 
not always construct the distortion function that selects for these relevant features. In the 
Information Bottleneck method (Tishby, Pereira and Bialek, 1999) the relevant information 
in the data is defined as information about another variable, v G V. Both x and v are random 
variables and we assume that we know the distribution of co-occurrences, P(x,v). We wish 
to compress x into clusters c, such that the relevant information, i.e. the information about 
v, is maximally preserved. This leads directly to the optimization problem 



max \I(c: v) — TI(c: x)} 

P(c\x) y n 



(7) 



One obtains a solution similar to Eq. (J3J) 
in which the Kullback-Leibler divergence 



cxp 



T 



D KL [P{v\x)\\P{v\c)\ 



D KL [P(v\x)\\P(v\c)} =J2P(v\x) log 2 



'P(v 


x) 


_P(v 





(9) 



emerges in the place of the distortion function (Tishby, Pereira and Bialek, 1999), providing 
a notion of similarity between the distributions P(v\x) and P(v\c), where P(v\c) is given by 



P(v\c) 



P c 



J2P(v\x)P(c\x)P(x). 



(10) 



When we plot I(c;v) as a function of I(c;x), both evaluated at the optimum, we obtain 
a curve similar to the Rate Distortion Curve, the slope of which is given by the trade-off 
between compression and preservation of relevant information: 



51{c; v) 
SI(c; x) 



T. 



(11) 
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3 Finite sample effects 



The formulation above assumes that we know the probability distribution underlying the 
data, but in practice we have access only to a finite number of samples, and so there are 
errors in the estimation of the distribution. These random errors produce a systematic error 
in the computation of the cost function. The idea here is to compute the error perturbatively 
and subtract it from the objective function. Optimization with respect to the assignment 
rule is now by definition insensitive to noise and we should (for the IB) find a value for the 
trade-off parameter T* at which the relevant information is kept maximally. 

The compression problem expressed in Eq. (J7J) gives us the right answer if we evaluate 
the functional (J7J) at the true distribution P(x,v). But in practice we do not know P(x,v), 
instead we have to use an estimate P(x,v) based on a finite data set. We use perturbation 
theory to compute the systematic error in the cost function that results from the uncertainty 
in the estimate. 



We first consider the case that P(x) is known and we have to estimate only the distri- 
bution P(v\x). This is the case in many practical clustering problems, where x is merely an 
index to the identity of samples, and hence P(x) is constant, and the real challenge is to 
estimate P(v \x). In section we discuss the error that comes from uncertainty in P(x) and 
also what happens when we apply this approach to rate-distortion theory. 



Viewed as a functional of P(c\x), I(c;x) can have errors arising only from uncertainty in 
estimating P(x). Therefore, if P(x) is known, then there is no bias in I(c;x). We assume 
for simplicity that v is discrete. Let N be the total number of observations of x and v. For 
a given x, the (average) number of observations of v is then NP(x). We assume that the 
estimate P(v\x) converges to the true distribution in the limit of large data set size N — > oo. 
However, for finite N, the estimated distribution will differ from the true distribution and 
there is a regime in which N is large enough such that we can approximate (compare Treves 
and Panzeri, 1995) 

P(v\x) = P(v\x) +SP(v\x), (12) 

where we assume that 5P(v\x) is some small perturbation and its average over all possible 
realizations of the data is zero 

(5P(v\x)) = 0. (13) 



Taylor expansion of I emp (c;v 
AI(c;v): 

I(c: v 



j M)Ipw*) 



where the error 

AJ(c; v) 



P(v\x)=P(v\x)+5P(v\x) 



E^EE---E 



around P(v \x) leads to a systematic error 

(14) 



I{c;v)\ P ( v \ x ) + AI(c;v 



5 n I(c:v) 



n=l 



III 



c(i) 



with 



S n I{c; v) 



m=isp(v\ x 



(£)> 



(n-2)\ 



njUP(c|*«; 



n^( 

\k=l 



V\X 



E 



(P(c,v)) n ~ l (P(v)) n - 



(15) 



(16) 
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is given by 



Al(c;v) 



E 



In 2 ^ n{n 



(P^v))"" 1 



((E^P(^k)P(x))") 

(p( v ))n-l 



(17) 



Note that the terms with n = 1 vanish, because of Eq. (jlMj) and that the second term in the 
sum is constant with respect to P(c\x). 

Our idea is to subtract this error from the objective function (J7J) and to recompute the 
distribution that maximizes the corrected objective function. 



max 

P(c\x) 



P mp (c; v) - TF mp (c; x) - A/(c; v) + fx(x) ^ P{c\x) 



The last constraint ensures normalization, and the optimal assignment rule P(c\x) is now 
given by 



P{c\x) 



P(c) 
Z{x,T) 



cxp 



1 / °° (—l) n 

Jd kl [P(v\x)\\P(v\c)] + J:Y, { 



T 



v n=2 
n-l\ 



ln(2) 



P(v\x 



((5P(v\c)) n ) {5P(v\x){5P{v\c)) n - 1 ) 



n(P(v\c)) n (n - l)(P(w|c)) n - 1 
which has to be solved self consistently together with Eq. (|1U|) and 

5P(v\c) :=J2 5P ( v \ x ) p ( x \ c ) 



(19) 



(20) 



The error A/(c; v ) is calculated in Eq. ()17|) as an asymptotic expansion and we are assuming 
that N is large enough to ensure that 5P(v\x) is small Vi>. Let us thus concentrate on the 
term of leading order in 5P(v\x) which is given by (disregarding the term which does not 
depend on P(c\x)) 3 

(Alter v)) (2) = - V Z x [P(c\z)] 2 P(x\v) m) 



3 



We arrive at Eq. (|21|l by calculating the first leading order term (n = 2) in the sum of Eq. 117(1 : 



Ex El P{c, x)P{c, x')(6P(v\x)SP(v\x')) 



21n(2)^ P(c,v) 
making use of approximation IL'l'li and summing over x' , which leads to 

[IX1[C - V)) -2\n(2)N^ P(c\v)P(v) 
and then substituting P(v\x)P(x) / P(v) = P(x\v) and P(c\v) = ^2 X P (c\x)P(x\v) (compare Eq. jTHjl). 
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where we have made use of Eq. (|10p and the approximation (for counting statistics) 

(5P(v\x)5P(v\x')} ~ ^^|| . (22) 



Can we say something about the shape of the resulting "corrected" optimal information 
curve by analyzing the leading order error term (Eq. l2~Tj)? This term is bounded from above 
by the value it assumes in the deterministic limit (T — > 0), in which assignments P(c\x) are 
either 1 or and thus [P(c|x)] 2 = P(c|x), 4 

X„ is the number of bins we have used to obtain our estimate P(v\x). Note that if we 
had adopted a continuous, rather than a discrete, treatment, then the volume of the (finite) 
V-spa.ce would arise instead of K v . 5 If one does not make the approximation (}2*2*|) and, in 
addition, also keeps the terms constant in P(c\x), then one obtains for the upper bound 
(T — ► limit) of the n = 2 term from expression (|17j) : 

1 K — 1 

Vc-1) (24) 



21n(2) N 

which is the leading correction to the bias as derived in (Treves and Panzeri, 1995; Eq. 
(2.11); term C\). Similar to what these authors found when they computed higher order 
corrections we also found in numerical experiments that the leading term is a surprisingly 
good estimate of the total bias and we therefore feel confident to approximate the error by 
(121)1 . although we can not guarantee convergence of the series in (117)1 . 6 

Substitution of [P(c\x)f = P(c\x) into Eq. J2D gives (AI (c; v)) {2) = £ oc = 

21n(2)JV Sue = 2\n(2)N-KvN c . 

5 For choosing the number of bins, K v , as a function of the data set size N, we refer to the large body of 
literature on this problem, for example Hall and Hannan (1988). 
6 For counting statistics (binomial distribution) we have 

1 " r;' N k 

((SP(v\x)r) = (PH^))" £(-!)(— 



TV"- 1 ' ' 1 " ^ v ; k\(n-k)\ (P(v\x)) k 

where I = J2q=i^q^ t ne lq are positive integers, and the sum i k } runs over all partitions of k, i.e., 

values of l\, ...,lk such that Y^q=i Q^q = There is a growing number of contributions to the sum at 
each order n, some of which can be larger than the smallest terms of the expression at order n — 1. If 
there are enough measurements such that NP(x) is large, the binomial distribution approaches a normal 
distribution with ((6P(v\x)) 2n ) = (2n - 1)!! Nn p (x)n (P(v\x) - P{v\x) 2 ) n , and {(SPivlx)) 2 "- 1 ) = (n = 

1,2,...). Substituting this into (|17fl . and considering only terms with x' 1 ) = x^ = ... = x^ n \ we get 



J-V PCriAV 00 ( 2fc - 1 ) !! 1 P(x) {P(c\x)f / p , I v p / I 
\xi2 l^xvc r \ L i V ) 2-^k=\ 2k(2k-\) N P(v) (P(c\v))' 2 V^^) ^\V\X 



1 P(x) (P(c\x)) 2 (pr.,\„\ _ p(„,\„\2\ 

N P{v) (P(c\v)y 2 



which is not guaranteed to converge. 
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The lower bound of the leading order error (Eq. |2"T]) is given by 7 

1 1 



2 /(c;x) , (25) 



2 ln(2) N 

and hence the "corrected information curve" , which we define as 

J corr ( c; „) : = r m P( c; v ) - A/(c; v), (26) 
is (to leading order) bounded from above by 

Iub(c; v) = F^( C] v) - ^y^2«. (27) 

The slope of this upper bound is T — 2 T ^/2N (using Eq. Ijlljl ). and there is a maximum 
at 

Tub = ^2 /M0 - ( 2 §) 

If the hard clustering solution assigns equal numbers of data to each cluster, 

then the upper bound on the error, Eq. ()23|) . can be rewritten as 



1 Kv 2^ x \ (29) 



21n(2) N 

and therefore, the lower bound on the information curve, 

I^(c; v) = / cmp (c; v) - ^2^\ (30) 

has a maximum at 

T* = ^2 /(c;x) . (31) 

LB 2N V ) 

Since both upper and lower bound coincide at the endpoint of the curve, where T — ► (see 
sketch in Fig. the actual corrected information curve must have a maximum at 

T* = -^2 1{c ' x) . (32) 

where 1 < 7 < K v . 

In general, for deterministic assignments, the information we gain by adding another cluster 
saturates for large N c , and it is reasonable to assume that this information grows sub-linearly 
in the number of clusters. That means that the lower bound on J corr (c; v) has a maximum 
(or at least a plateau). This ensures us that J corr (c; v) must have a maximum (or plateau), 
and hence that an optimal temperature exists. 



7p mn f V [P(c\x)YP(x\v) _ v p / A P{c\x) v PHs) v p / n P(c|:e) 

riuui. xvc P{c\v) ~ l^xc r \ x ' c ) P( c ) l^vP{v\c) l^xc r \ x ^> P( c ) 
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I(c,x) 

Figure 1: Sketch of the lower and upper bound on the corrected information curve which 
both have a maximum under some conditions (see Eqs. (|28|l and (pH|)). indicated by x-es, 
compared to the empirical information curve which is monotonically increasing. 
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In the context of the IB, asking for the number of clusters that are consistent with the 
uncertainty in our estimation of P(v\x) makes sense only for deterministic assignments. 
From the above discussion, we know the leading order error term in the deterministic limit, 
and we define the "corrected relevant information" in the limit T — > as: 8 

«c; v) = J- P (c; v) - j^^N c , (33) 

where 7^f (c; v) is calculated by fixing the number of clusters and cooling the temperature 
down to obtain a hard clustering solution. While J^f (c; v) increases monotonically with 
N c , we expect I^" (c; v) to have a maximum (or at least a plateau) at N*, as we have argued 
above. N*, is then the optimal number of clusters in the sense that using more clusters, we 
would not capture more meaningful structure (or in other words would "overfit" the data), 
and although in principle we could always use fewer clusters, this comes at the cost of keeping 
less relevant information I(c;v). 



4 Numerical results 

4.1 Simple synthetic test data 

We test our method for finding N* on data that we understand well and where we know what 
the answer should be. We thus created synthetic data drawn from normal distributions with 
zero mean and 5 different variances (for Figs. [^] and I3J) - 9 We emphasize that we chose an 
example with Gaussian distributions not because any of our analysis makes use of Gaussian 
assumptions, but rather because in the Gaussian case we have a clear intuition about the 
similarity of different distributions and hence about the difficulty of the clustering task. This 
will become important later, when we make the discrimination task harder (see Fig. |UJ). We 
use K v = 100 bins to estimate P(v\x). In Figs. [21 and 01 we compare how J^f (c;t>) and 
It ™o{ c 'i v ) behave as a function of the number of clusters. The number of observations of v, 
given x, is N v = N/N x . For a large range of "resolutions" (average number of observations 
per bin), N v /K v , I^ (c; v) has a maximum at N* = 5 (the true number of clusters). When 
we have too little data (N v /K v = 1), we can resolve only 4 clusters. The curves in Figs. 121 
and El are calculated as the mean of the curves obtained from 5 different realizations of the 
data, 10 and the error bars are ± 1 standard deviation. The optimal number of clusters N*, 
as determined by our method, is always the same, for each of these individual curves, so that 
there are no error bars on the optimal number of clusters as a function of the data set size. 
As N v /K v becomes very large, It™o(c;v) approaches I^ (c;v), as expected. 

The curves in Figs. 121 and El differ in the average number of examples per bin, N v /K v . 
8 This quantity is not strictly an information anymore, thus the quotes. 

9 P(x) = l/N x and P(v\x) = Af(0,a(x)) where a(x) £ A, and \A\ = 5, with P(a) = 1/5; and N x = 50. 
N x is the number of "objects" we are clustering. 

10 Each time we compute Z^f (c; v), we start at 100 different, randomly chosen initial conditions to increase 
the probability of finding the global maximum of the objective functional. 
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Figure 2: Result of clustering synthetic data with P(v\x) = Af(0,a(x)); 5 possible values 
for a. Displayed is the relevant information kept in the compression, computed from the 
empirical distribution, I^^ (c; v), which increases monotonically as a function of the number 
of clusters. Each curve is computed as the mean of 5 different realizations of the data 
with error bars of ± 1 standard deviation. N v /K v equals 1 (diamonds), 2 (squares), 3 
(down pointing triangles), 5 (stars), 10 (up pointing triangles), 15 (circles) and 50 (crosses). 
N x = 50 and K v = 100 for all curves. The line is drawn at the value of the information 
I(x;v), estimated from 10 6 data points. 
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Figure 3: Result of clustering synthetic data with P{y\x) = Af(0, cx(x)); 5 possible values for 
a. Displayed is the "corrected relevant information information" in the hard clustering limit, 
-^t°^o( c ' v )-> ( see Eq- ESI) as a function of the number of clusters. Each curve is computed as 
the mean of 5 different realizations of the data with error bars of ± 1 standard deviation. 
N v /K v equals 1 (diamonds), 2 (squares), 3 (down pointing triangles), 5 (stars), 10 (up 
pointing triangles), 15 (circles) and 50 (crosses). All individual curves (not just the means) 
peak at N* = 5, except for those with N v /K v = 1, which peak at N* = 4. N x = 50 and 
K v = 100 for all curves. The line is drawn at the value of the information J(x; v), estimated 
from 10 6 data points. 
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Figure 4: A trivial examples of those data sets on which we found the correct number of 
clusters (results are summarized in Fig. [HJ). Here, P(y\x) = J\f(a(x), 1) with 5 different 
values for a, spaced da = 2 apart. K v = 100, N x = 20, N v /K v = 20. 
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Figure 5: One of the difficult examples of those data sets on which we found the correct 
number of clusters (results are summarized in Fig. |UJ). Here, P(v\x) = Af(a(x),l) with 5 
different values for a, spaced da = 0.2 apart. K v = 100, N x = 20, N v /K v = 20. 
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Figure 6: Result of finding the correct number of clusters with our method for a synthetic 
data set of size iV = N X N V (N x = 20) with P{v\x) = Af(a(x), 1) and with either 2, 5 or 
10 possible values for a, spaced da apart. We indicate values of da and the resolution 
N v /K v (K v = 100) at which the correct number of clusters is found: for 2, 5 and 10 clusters 
(squares); only for 2 and 5 clusters (stars); only for 2 clusters (circles). The classification 
error (on the training data) is for all points except for the one that is labeled with 95% 
correct. 
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The classification problem becomes harder as we see less data. However, it also becomes 
harder when the true distributions are more like each other. To separate the two effects, we 
create synthetic data drawn from Gaussian distributions with unit variance and Na different, 
equidistant means a, which are da apart. 11 Na is the true number of clusters. This problem 
becomes intrinsically harder as da becomes smaller. Examples are shown in Figs. 0] and 03 
The problem becomes easier as we are allowed to look at more data, which corresponds to 
an increase in N v /K v . We are interested in the regime in the space spanned by N v /K v and 
da in which our method retrieves the correct number of clusters. 

In Fig. El points mark those values of da and N v /K v (evaluated on the shown grid) at 
which we find the true number of clusters. The different shapes of the points summarize 
results for 2, 5 and 10 clusters. A missing point on the grid indicates a value of da and 
N v /K v at which we did not find the correct number of clusters. All these missing points lie 
in a regime which is characterized by a strong overlap of the true distributions combined with 
scarce data. In that regime, our method always tells us that we can resolve fewer clusters 
than the true number of clusters. For small sample sizes, the correct number of clusters is 
resolved only if the clusters are well separated, but as we accumulate more data, we can 
recover the correct number of classes for more and more overlapping clusters. To illustrate 
the performance of the method, we show in Fig. El the distribution P(x,v) in which a(x) 
has 5 different values which occur with equal probability, P(a(x)) = 1/5 and which differ by 
da = 0.2. For this separation, our method still retrieves 5 as the optimal number of clusters 
when we have N v = 2000 observations per example x. 12 

Our method detects when only one cluster is present, a case in which many methods fail 
(Gordon, 1999). We verified this for data drawn from one Gaussian distribution and for data 
drawn from the uniform distribution. 

4.2 Synthetic test data that explicitly violates mixture model as- 
sumptions 

We consider data drawn from a radial normal distribution, according to P(r) = A/"(l,0.2), 
with x = rcos((f)), v = rsin(4>), and P{4>) = 1/2-7T, as shown in Fig. [7| The empirical 
information curves (Fig. |HJ) and corrected information curves (Fig. EJ) are computed as the 
mean of 7 different realizations of the data for different sample sizes. 13 The corrected curves 
peak at N*, which is shown as a function of N in Fig. For less than a few thousand 
samples, the optimal number of clusters goes roughly as N* oc iV 2 / 3 , but there is a saturation 
around N* « 25. This number corresponds to half of the number of x-bins (and therefore half 
of the number of "objects" we are trying to cluster) which makes sense given the symmetry 
of the problem. 

n P(v\x) = M(a(x), 1), a{x) e A, N A := \A\. 

12 We used K v = 100 bins to estimate P(v\x).The distribution of examples is simply P(x) = 1/N X with 
N x = 20. 

13 Each time we compute Z^f (c; v), we start at 20 different, randomly chosen initial conditions to increase 
the probability of finding the global maximum of the objective functional. Increasing the number of initial 
conditions would decrease the error bars at the cost of computational time. 
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Figure 7: 20000 data points drawn from a radial distribution, according to P(r) = 0.2), 
with x = rcos((p), v = rsin((p), P((p) = l/2n. Displayed is the estimated probability 
distribution (normalized histogram with 50 bins along each axis). 
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Figure 8: Z^f (c; v) as a function of the number of clusters, averaged over 7 different realiza- 
tions of the data. Error bars are ± 1 standard deviation. The information I(x; v) calculated 
from 100000 data points is 0.58 bits (line). Data set size N is: 100 (diamonds), 300 (squares), 
1000 (down pointing triangles), 3000 (stars), 10000 (up pointing triangles), 30000 (circles), 
100000 (crosses). 
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Figure 9: /£™ r (c; v) as a function of the number of clusters, averaged over 7 different realiza- 
tions of the data. Error bars are ± 1 standard deviation. The information I(x; v) calculated 
from 100000 data points is 0.58 bits (line). Data set size N is: 100 (diamonds), 300 (squares), 
1000 (down pointing triangles), 3000 (stars), 10000 (up pointing triangles), 30000 (circles), 
100000 (crosses). 
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Figure 10: Optimal number of clusters, N*, as found by the suggested method, as a function 
of the data set size N. The middle curve (crosses) represents the average over 7 different 
realizations of the data, points on the upper/lower curve ± 1 standard deviation. Line at 
25. 
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5 Uncertainty in P(x) 



In the most general case, x can be a continuous variable drawn from an unknown distribution 
P(x). We then have to estimate the full distribution P(x, v) and if we want to follow the same 
treatment as above, we have to assume that our estimate approximates the true distribution 

P(v,x) = P(v,x) +8P(v,x), (34) 

where 5P(v, x) is some small perturbation and its average over all possible realizations of 
the data is zero 

(5P(v,x)) = 0. (35) 

Now, this estimate induces an error not only in I emp (c;v), but also in I emp (c;x). Taylor 
expansion of these two terms gives 

Arr v i v ~ (~ir / i i \ 

{C ' V> " \n(2)^^ 2 n(n-l)[(P(v,c))^ (P(c))^ ) 

X ((Y1 P(c\x)5P(x, v)^j \ - A(P(v)) (36) 

A(PW) = ^?IS^f((? ,p( -')*) (37) 

1 00 (— l) n 1 

A/(c;x) = "i^j?Sn(n-l)(P(r 

x^^P( C |ar)5P(x,^)^ \ (38) 
This results in a correction to the objective function (f7 1corrected = p em P _ AF), given by: 

AF = H2) 5 £ n{n - 1) (P(c))-i ( (P(v\c))^ +T ~ 1 ) (39) 
x(^(j2P(c\x)5P(x,v)j ^-A(P(v)), 

where A(P(v)) is constant in P(c\x) and therefore not important. At critical temperature 
T = 1, the error due to uncertainty in P(x) made in calculating J emp (c; v ) cancels that made 
in computing I emp (c;x). For small T, the largest contribution to the error is given by the 
first term in the sum of Eq. (j3Uj) . since l/(P(v\c)) n > 1, V{n, v, c}. Therefore, the procedure 
that we have suggested for finding the optimal number of clusters in the deterministic limit 
(T — > 0) remains unchanged, even if P(x) is unknown. Let us consider, as before, the leading 
order term of the error (using the approximation in Eq. (|2*2*|) ) 



(AF) (2) = ^- Y 4t ( -^rr +T-l) Y(P(c\x)) 2 P 

1 ; 2AHn(2) P(c) \p(v\c) 1 1 

In the T — > limit this term becomes N C (K V — l)/2iVln(2), and we find, 



(x,v). (40) 



«c; v) = i£*( C ; v) - ^^N c , (41) 
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which is insignificantly different from Eq. (|33J) in the regime K v >> 1. 



Only for very large temperatures T » 1 (i.e. at the onset of the annealing process) could 
the error that results from uncertainty in P(x) make a significant difference. 



The corrected objective function is now given by 

F corr = JC mp( c . ^ _ T/ emp^. ^ _ AF _ ^ £ p{p\x), 

c 

and the optimal assignment rule is given by 



P(c\x) 



P(c) 
Z(x,T) 

1 

4 



cxp 



f[D KL [P(c\.r)\\P[r\c)\ 



EE(-i) n 



ln(2)P(x) V^=2 
x((6P(v,c)T) - 



1 



1 



n(P(c)) n \ (P(v\c)) n 
1 ( 1 



(n-l)(P(c)) n - 1 {(Pivlc))™' 1 
x(5P(x,v)(5P(v,c)) n - 1 )^j , 

which has to be solved self consistently together with Eq. (fTU|) and 

5P(v,c) :=Y,5P{v,x)P(c\x) 



+ T-1J 
T - 1 



(42) 



(43) 



(44) 



Rate— distortion theory. Let us assume that we estimate the distribution P(x) by P(x) = 
P(x) + 5P(x), with (5P(x)) = 0, as before. While there is no systematic error in the 
computation of (d), this uncertainty in P(x) does produce a systematic under estimation of 
the information cost I(c:x): 



AI(c;x) 



-I CO 

E 



■1)" y^{(E x P(c\x)5P(x)T) 



ln(2) ^ 2 n(n - 1) ^ (^(c))™" 1 
When we correct the cost functional for this error (with A = 1/T), 

F corr ;= j^. ^ + A ^^ X) _ A/ ( c . x j + ^ £ p( c | x ) ; 

c 

we obtain for the optimal assignment rule (with A' = Aln(2)), 



(45) 



(46) 



P(c\x) 



P(c) 
Z(x,X) 



exp 



-X'd(x,x c ) + 



((Y, x P(c\x)6P(x)T) 



r„-2\ "(/'('•!)" 

1 (5P(x)(X x P(c\x)6P(x)) n - 1 



(47) 



P{x) (n - 1) (P(c)y 

Let us consider the leading order term of the error made in calculating the information cost, 



(A/(c;x)) (2) 



1 



21n(2)iV 
23 



E 



(Eg (P(c|x)(5P(x)) 2 ) 
P(c) 



(48) 



For counting statistics, we can approximate, as before, 



(ATfr-r))®- 1 V — ( P ( c \ x )) 2 P ( x ) (A a) 

(A/(C,;r)) ~ 2H2)N^ P{6) • (49) 

The information cost is therefore underestimated by at least 2 I ^ C '^ /2 bits. 14 The corrected 
rate-distortion curve with 

J COTr ( c; x ) ■= j( c; x ) - A/(c; x) (50) 
is then bounded from below by 

I?£(c; x) = J( C ; x) + _L_2'fe*> (51) 

and this bound has a rescaled slope given by 

A = V (l - 2^2'^)) (52) 

but no extremum. Since there is no optimal trade-off, it is not possible to use the same 
arguments as we have used before to determine an optimal number of clusters in the hard 
clustering limit. To do this, we have to carry the results obtained from the treatment of 
the finite sample size effects in the IB over to rate-distortion theory. This is possible, with 
insights we have gained in Still, Bialek and Bottou (2004) about how to use the IB for data 
that are given with some measure of distance (or distortion). 



6 Summary 

Clustering, as a form of lossy data compression, is a trade-off between the quality and com- 
plexity of representations. In general, a data set (or clustering problem) is characterized by 
the whole structure of this trade-off — the rate-distortion curve or the information curve in 
the IB method — which quantifies our intuition that some data are more clusterable than 
others. In this sense there is never a single "best" clustering of the data, just a family of 
solutions evolving as a function of temperature. 

As we solve the clustering problem at lower temperatures, we find solutions that reveal 
more and more detailed structure and hence have more distinct clusters. If we have only 
finite data sets, however, we expect that there is an end to the meaningful structure that 
can be resolved — at some point separating clusters into smaller groups just corresponds to 
fitting the sampling noise. The traditional approach to this issue is to solve the clustering 
problem in full, and then to test for significance or validity of the results by some additional 
statistical criteria. What we have presented in this work is, we believe, a new approach: 
Because clustering is formulated as an optimization problem, we can try to take account of 

"Using J2 xc P(x,c)^> =E ;cc P(x, C )2 lo ^ [i W 1 >2^). 
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the sampling errors and biases directly in the objective functional. In particular, for the In- 
formation Bottleneck method all terms in the objective functional are mutual informations, 
and there is a large literature on the systematic biases in information estimation. There 
is a perturbative regime in which these biases have a universal form and can be corrected. 
Applying these corrections, we find that at fixed sample size the trade-off between com- 
plexity and quality really does have an endpoint beyond which lowering the temperature or 
increasing the number of clusters does not resolve more relevant information. We have seen 
numerically that in model problems this strategy is sufficient to set the maximum number 
of resolvable clusters at the correct value. 
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